Flight of the PEGASUS? Comparing Transformers on Few-shot and Zero-shot Multi-document Abstractive Summarization

Recent work has shown that pre-trained Transformers obtain remarkable performance on many natural language processing tasks including automatic summarization. However, most work has focused on (relatively) data-rich single-document summarization settings. In this paper, we explore highly-abstractive multi-document summarization where the summary is explicitly conditioned on a user-given topic statement or question. We compare the summarization quality produced by three state-of-the-art transformer-based models: BART, T5, and PEGASUS. We report the performance on four challenging summarization datasets: three from the general domain and one from consumer health in both zero-shot and few-shot learning settings. While prior work has shown significant differences in performance for these models on standard summarization tasks, our results indicate that with as few as 10 labeled examples there is no statistically significant difference in summary quality, suggesting the need for more abstractive benchmark collections when determining state-of-the-art.


Introduction
Since its inception (Luhn, 1958), automatic summarization has focused on summarizing documents either in a generic way -conveying the main points of the document to any reader regardless of their information need -or in a task-specific way -distilling the important points of the document with respect to a specific information need such as a question or topic statement (Mani, 2009). In the latter case, the selection of the most salient points in the document (i.e., content selection) as well as the expression of those points (i.e., surface realization) must be explicitly conditioned on a user-given natural language context statement, such as a question or topic of interest. In this setting, a single passage may be summarized in different ways depending on the context description. Consequently, obtaining reference summaries is often time-or cost-prohibitive, particularly when dealing with specialized domains such as healthcare.
The Document Understanding Conference has explored Topic-driven summarization (DUC) and its successor, the Text Analysis Conference (TAC), which both ran community evaluations of topic-or question-based summarization. Specifically, participants were asked to develop automatic summarization approaches for generating single-or multi-document summaries that summarized a set of documents with respect to a given topic description or question, as shown in Figure 1. Human assessors manually judged submitted summaries.
In this work, we revisit the multi-document topic-driven abstractive summarization datasets produced from DUC 2007, TAC 2009, and TAC 2010, as well as question-driven summarization from consumer health. Because these datasets are relatively small (approximately 45 topics each), we explore modern transformer-based models' performance in the zero-shot and few-shot (10 examples) learning settings. Specifically, we explore the quality of multi-document abstractive summarization generated by T5 (Raffel et al., 2019), BART (Lewis et al., 2019), and PEGASUS   Figure 1: Example topic-and question-driven multi-document abstractive summaries (documents omitted).

Background
Recent work has indicated that transfer learning (pre-training a model on data-rich tasks before fine-tuning it on a downstream task) obtains remarkable performance on many natural language processing tasks Dong et al., 2019;Liu et al., 2019b). The most successful models are obtained through self-supervised pre-training with massive datasets to obtain transferable knowledge for new tasks (i.e., fine-tuning) with less abundant data (Devlin et al., 2019;Lewis et al., 2019;Keskar et al., 2019;Raffel et al., 2019). More recently, research has indicated that these models can generate language conditioned on a user-given prompt or context. For example, this prompt can guide the model's content selection towards a particular topic (Keskar et al., 2019) or inform surface realization for a specific task (Lewis et al., 2019;Raffel et al., 2019). In Liu et al. (2020), the authors condition an extractive transformer using "control codes" to specify the position, importance, and diversity of the sentences in the source text. In this work, we adapt this paradigm to train and evaluate BART, T5, and PEGASUS for abstractive multi-document summarization.
Although zero-shot learning (ZSL) has received considerable attention in the image processing community, there has been comparatively little work on zero-shot learning specifically for summarization: Duan et al. (2019) explore zero-shot learning for cross-lingual sentence summarization and Liu et al. (2019a) explored zero-shot abstractive summaries of five-sentence stories. We extend these works by evaluating zero-shot and few-shot learning for multi-document abstractive summarization.

Models
In this work, we compare three of the most prominent conditional language generation models: T5, BART, and PEGASUS. To facilitate comparison, for each model we chose the variant with the most similar architecture (such that each consists of 12 transformer layers and a similar number of learnable parameters). Each model is pre-trained with unique strategies as described below.
BART (Bidirectional and Auto-Regressive Transformers) is pre-trained on document rotation, sentence permutation, text-infilling, and token masking and deletion objectives (Lewis et al., 2019). In our experiments, we used BART-Large.
T5 (Text-to-Text Transfer Transformer) is pre-trained on several unsupervised and supervised objectives, such as token and span masking, as well as translation, classification, reading comprehension, and summarization. Importantly, each objective is treated as a language-generation task, where the model is conditioned to generate the correct output based on a textual prompt included in the input sequence (Raffel et al., 2019). In this work, we used T5-Base.
PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-tosequence) was specifically designed for abstractive summarization and is pre-trained with a self-supervised gap-sentence-generation objective . In this task, entire sentences are masked from the source document, concatenated, and used as the target "summary". We used PEGASUS-Base in our experiments.

Answer Summarization at DUC 2007
The 2007 challenge of the Document Understanding Conference (DUC) focused on answering 45 natural language questions by summarizing sets of 10 documents from the AQUAINT English news corpus (Graff, 2002). Reference summaries were between 230 and 250 words. We used 30 topics for testing (with 10 for training and 5 for validation under FSL).

Update Summarization at TAC 2009
In 2009, the Text Analysis Conference (TAC) summarization evaluation explored summarizing sets of 10 newswire articles with respect to a given topic description in approximately 100 words under the assumption that a user had already read a given set of earlier articles (Dang and Owczarzak, 2009

Guided Summarization at TAC 2010
Similar to the 2009 evaluation, the summarization track's goal in TAC 2010 was to produce 100-word summaries of sets of 10 newswires articles for 46 given topics. However, in 2010 each topic was assigned to one of five pre-defined categories, and summaries were expected to cover all aspects associated with that category (e.g., for Accidents and Natural Disasters, summaries should cover (a) what happened, (b) when it happened, (c) the reasons for the accident or disaster, (d) casualties, (e) damages, and (f) rescue efforts or countermeasures) (Owczarzak and Dang, 2010). We used 30 topics for testing (with 10 for training and 6 for validation). Results are illustrated in Table 3. In this case, BART had the highest performance in both ZSL and FSL settings, although FSL provided significant improvements for all models, allowing T5 to obtain similar ROUGE-2 and ROUGE-L performance.

MEDIQA Summarization
The MEDIQA collection contains consumer health questions, sets of passages extracted from reliable websites relevant to the question, and human-authored multi-document summaries of the passages intended to provide consumer-friendly answers (Savery et al., 2020). Of the 156 available abstractive multi-document summaries, we used 141 questions for testing (with 10 for training and 5 for validation under FSL). Table 4 provides these results. While FSL provided a clear improvement for all models, there were no statistically significant differences in summary quality between the three models using FSL. Example summaries from all systems for a single MEDIQA question are provided in Figure 3.

Conclusion
We evaluated the summarization quality produced by three state-of-the-art transformers: BART, T5, and PEGASUS on four challenging summarization dataset in both zero-shot and few-shot learning settings. Our results indicate that, while there are statistically significant differences between the models in zero-shot settings, after few-shot learning with as few as 10 examples, there is little discernible difference between them. This suggests that while large improvements have been made on standard single-document benchmarks, highly abstractive multi-document summarization remains challenging. There are many factors that contribute to increasing your child's risk of becoming overweight. Regularly eating high calorie foods can cause your child to gain weight. Lack of physical exercise can contribute to childhood obesity. If your child comes from a family of overweight people, he or she may be more likely to put on weight. Personal, parental, and family stress can also increase a child's risk of obesity.

BART (FSL):
There are many risk factors for overweight and obesity. Some risk factors can be changed, such as age, family history and genetics, race and ethnicity, and sex. However, some risk factors cannot be changed. These include: poor diet, lack of physical activity, unhealthy eating patterns, not enough sleep, and high amounts of stress. social factors such as having a low socioeconomic status or an unhealthy social or unsafe environment in the neighborhood. Eating too much saturated and trans fats, eating foods high in added sugars, and having an unhealthy eating pattern can increase your risk for obesity. Avoiding foods that are high in fat and saturated fat, eating smaller meals, and reducing screen time

T5 (ZSL):
Unhealthy lifestyle habits Lack of physical activity, unhealthy eating patterns, not enough sleep, and high amounts of stress can increase your risk for overweight and obesity . age Childhood obesity remains a serious problem in the united states, and some populations are more at risk for childhood obesity than others .

PEGASUS (FSL):
Lifestyle changes can decrease your risk for developing overweight and obesity. Unhealthy lifestyle habits, such as eating more calories than you use, and not enough sleep can increase your risk for overweight and obesity. Lack of physical activity due to high amounts of TV, computer, videogame or other screen usage has been associated with a high body mass index. Healthy lifestyle changes, such as being physically active and reducing screen time, can help you aim for a healthy weight.

PEGASUS (ZSL):
Unhealthy lifestyle habits Lack of physical activity, unhealthy eating patterns, not enough sleep, and high amounts of stress can increase your risk for overweight and obesity. Unhealthy environments Many environmental factors can increase your risk for overweight and obesity: social factors such as having a low socioeconomic status or an unhealthy social or unsafe environment in the neighborhood built environment factors such as easy access to unhealthy fast foods, limited access to recreational facilities or parks, and few safe or easy ways to get exercise. Read more Overweight and obesity can be prevented by eating healthy foods, getting enough sleep, and reducing stress. Read more Overweight and obesity can be prevented by eating healthy foods, getting enough sleep, and reducing stress. Read